R is a statistical programming environment, and together with its extensions, R enables data manipulation, analysis, graphing and article writing all within the same powerful scientific tool. If used correctly, R can save weeks of tedious data handling compared to a commonly used alternative. Learning R, however, can be a daunting task. The introduction to R course framework developed by Dr. Mikko Vihtakari consists of one-day-long courses that build upon each other. These courses can be offered flexibly tailored to the needs of the audience.
This is a handout for Graphing and linear models in R, the second course in the series. You can order the author to give any parts of these lectures or other tailored workshops to your institute. The course consists of four lectures:
Course description: R is a powerful graphical tool offering the freedom to design virtually any type of scientific graphics. During this one-day long course, you will learn the basics of R graphing using both the base graphics and ggplot2, a widely used extension to R graphing capabilities. Further, the course goes through basic linear models within the lm framework in R. The course is aimed for R new-beginners, for those who are not familiar with ggplot2, and for those who want to learn about linear models in R. If you are new to R, taking the R grammar course is highly recommended before enrolling for this course.
R is a powerful graphing tool, which can be used for virtually any kind of plot from simple graphs to maps or 3D teapots. Basic graphics commands produce rather ugly plots, but the possibility to add your features to the plots is almost infinite. R Graph Gallery contains many examples of plots made with R.
During this lecture, you will learn how to produce simple plots with R’s inbuilt base graphics system, as well as with an alternative called ggplot2, which makes advanced graphs easier.
We use a dataset from Vihtakari et al. 2017 as an example. This article is about black-legged kittiwake diet species in Kongsfjorden from 1982 until 2016. The data should be available in the Data folder.
2 in front of both files, place the files in the Data folder in your project directory and run the script below.
dt <- read.csv("Data/2 kittiwake_diet_bin.csv")
env <- read.csv("Data/2 explanatory_data.csv")
Once the data have been successfully imported (i.e. you won’t get any errors), we manipulate the data into wide and long formats. The script contains some more advanced commands covered during the Advanced data manipulation course.
# Load required packages
library(tidyverse)
library(reshape2)
# Modify the datasets
mx <- reshape2::melt(dt, id = 1:7)
mx <- merge(mx, env, by = "year", all.x = TRUE)
## Frequency of occurrence (FO) in wide format
wide.dat <- reshape2::dcast(mx, year + ice ~ variable, function(x) 100*mean(x))
## FO and number of samples / year in long format
long.dat <- mx %>%
dplyr::group_by(year, ice, variable) %>%
dplyr::summarise(fo = 100*mean(value), n = length(value))
Tabular data can be presented in wide or long (= narrow) formats. Some R functions use data in a wide format, whereas others in a long format. To generalize, base graphics functions often use wide-format, whereas ggplot2 almost as often uses long data format. This is not always the case, as you will see soon, but the data format depends on what you want to plot. There are many ways to switch between the data formats, some of which are learned during the Advanced data manipulation course.
head(round(wide.dat[1:8],2)) # 8 first cols and 6 first rows of wide data
## year ice polar_cod capelin herring glacier_lanternfish cod barracudina
## 1 1982 20.54 85.71 0 0 0 0 0
## 2 1983 18.82 85.71 0 0 0 0 0
## 3 1984 12.75 20.00 0 0 0 0 0
## 4 1987 9.08 27.27 0 0 0 0 0
## 5 1997 9.70 71.43 0 0 0 0 0
## 6 1998 14.28 72.73 0 0 0 0 0
head(long.dat) # 6 first rows of long data
## # A tibble: 6 x 5
## # Groups: year, ice [1]
## year ice variable fo n
## <int> <dbl> <fct> <dbl> <int>
## 1 1982 20.5 polar_cod 85.7 7
## 2 1982 20.5 capelin 0 7
## 3 1982 20.5 herring 0 7
## 4 1982 20.5 glacier_lanternfish 0 7
## 5 1982 20.5 cod 0 7
## 6 1982 20.5 barracudina 0 7
plotR has an integrated graphics system (the graphics package), which is highly customizable. The plot function is the generic function for plotting in R. The function requires for the user to specify and add parameters.
The base graphics package contains functions that extend the capabilities of plot function for easier use of popular graphs. Chapter 5 in Logan’s Biostatistical Design and Analysis using R summarizes the base graphics features nicely.
Exercise E1.1
Make a scatter plot using the base graphics system, help sheets (?plot), Logan’s book, and the provided cheat sheets. Use polar cod frequency of occurrence on the y-axis and sea-ice index from the wide dataset (wide.dat) on the x-axis. The result should look as follows:
parAs you see, the outcome does not look very nice and can be adjusted using the graphical parameters. The list graphical of parameters can be found online or by using the inbuilt help sheets (?par).
Exercise E1.2
Now use the above-mentioned sources and the list of graphical parameters (?par) to add “Sea ice index” as x-axis label, “Polar cod frequency of occurrence (%)” as y-axis label, and red data points.
plotAdding trendlines to base graphics have to be done manually by first forming a linear model between the polar cod frequency of occurrence (y) and sea ice index (x). You will learn how to make linear models (i.e. regression analysis in this case) during the Linear models and graphics II lecture. If you want to add a trendline to your plot, you can do it with the abline function:
plot(wide.dat$ice, wide.dat$polar_cod, xlab = "Sea ice index", ylab = "Polar cod frequency of occurrence (%)", col = "red")
abline(lm(polar_cod ~ ice, data = wide.dat))
Figure 1: Base graphics exercise. Scatter plot and linear regression showing the polar cod frequency of occurrence against sea ice index.
There are several ways to export figures:
You can export R graphics on a A4 sized paper using a following code:
pdf(file="example.pdf", width = 0, height = 0, paper = "a4", useDingbats = FALSE) # size in inches. 0 uses A4
plot(wide.dat$ice, wide.dat$polar_cod, xlab = "Sea ice index", ylab = "Polar cod frequency of occurrence (%)", col = "red")
abline(lm(polar_cod ~ ice, data = wide.dat))
dev.off() # This closes the pdf "device", important!
ggplot2)The base graphics in R require quite some manual work to produce nice looking graphics. While this might not be a problem for scientific articles, quickly producing and changing good looking graphics is easier with Hadley Wickham’s ggplot2 package. The workflow and syntax of ggplot2 (ggplot for short) differ somewhat from base graphics, but once you learn the syntax, quick graph making is much easier with ggplot than with the base graphics.
Ggplot utilizes a graphing philosophy called the grammar of graphics presented in a book of a similar name by Leland Wilkinson (1999). As any grammar, the philosophy follows a relatively strict (and sometimes incomprehensible) formula to represent data visually and scientifically in the best and most honest ways. The idea is to think of different graphic elements as interchangeable layers (Figure 2).
Figure 2: The idea behind grammar of graphics is to layer different elements of plotting in a consistent way. Source: https://www.science-craft.com/2014/07/08/introducing-the-grammar-of-graphics-plotting-concept/
Ggplot comes with an excellent visual function reference internet site. Therefore it is better to have your internet browser open when working with ggplot instead of using the help sheets (which contain the same information, but you cannot see the example graphics).
Ggplot syntax begins with ggplot() function and you add the layers to the plot with the + operator. In the first part of the command you specify the dataset and aesthetics:
ggplot(data, aes(x = x_variable, y = y_variable))
After this you can add the geometry layers you wish to use. For example the basic syntax for the plot we made with base R would look as follows:
library(ggplot2)
ggplot(wide.dat, aes(x = ice, y = polar_cod)) + geom_point()
You can quickly add layers of specifications to ggplot graphics:
ggplot(wide.dat, aes(x = ice, y = polar_cod)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Sea ice index") +
scale_y_continuous(name = "Polar cod frequency of occurrence (%)",
breaks = seq(0, 100, by = 20)) +
coord_cartesian(ylim = c(0,100), xlim = c(0,21), expand = FALSE) +
theme_classic()
Unlike base graphics, you can assign ggplot graphics as objects, update them and reuse:
p <- ggplot(wide.dat, aes(x = ice, y = polar_cod)) +
geom_point() +
geom_smooth(method = "loess") +
xlab("Sea ice index") +
ylab("Polar cod frequency of occurrence (%)") +
scale_y_continuous(breaks = seq(0, 100, by = 20)) +
coord_cartesian(ylim = c(0,100), xlim = c(0,21), expand = FALSE) +
theme_classic()
p # plots object p
p %+%
aes(y = capelin) +
ylab("Capelin frequency of occurrence (%)") # plots capelin FO instead
The base R graphic exporting tricks can also be used for ggplot graphics. Besides, exporting single graphs from ggplot2 can be done through the ggsave function. A graph can either be saved from an object or ggsave uses the previous plot in the graphics window. In the following example, we will save the object p as a .png file. Also, other file types (pdf, svg, tiff, jpeg) are supported. width, height and scale arguments can be used to tweak the size of the plot. scale < 1 create larger point and font sizes. Dimensions of your open plotting window are used if you do not specify width and height.
ggsave(p, filename = "ggplot_example.png", width = 4, height = 3, units = "in")
It is possible to look at large amounts of relationships in data using the facet_wrap and facet_grid functions in ggplot. For this purpose we need data in long format:
x <- long.dat[long.dat$variable %in% c("polar_cod", "capelin", "thy_ine", "the_lib"),] # subset the dataset
x <- droplevels(x) # remove unused variable levels
levels(x$variable) <- c("Polar cod", "Capelin", "T. inermis", "T. libellula") # Rename the variable levels
## facet_wrap
ggplot(x, aes(x = ice, y = fo, size = n, color = variable)) +
geom_smooth(se = FALSE) +
geom_point() +
facet_wrap(~variable) +
scale_size(name = "Number of samples", breaks = seq(0,150,30)) +
theme_bw()
## Scientific names in italic; facet_grid
levels(x$variable) <- c("Polar~cod", "Capelin", "italic(T.~inermis)", "italic(T.~libellula)") #
ggplot(x, aes(x = ice, y = fo, size = n, color = variable)) +
geom_smooth(se = FALSE) +
geom_point() +
facet_grid(~variable, labeller = label_parsed) +
scale_size(name = "Number of samples", breaks = seq(0,150,30)) +
theme_bw()
Exercise E1.3
Now that you know how to use layers and facetting in ggplot, make a bar plot that shows polar cod, capelin, Thysanoessa inermis (thy_ine) and Themisto libellula (the_lib) frequency of occurrence against year. Save the plot to a PDF that uses column width requirement for Polar Biology as width and a fitting height to show your figure. A couple of tips: geom_col makes bar plots (geom_bar) with stat = "identity" and theme(legend.position = "none") to gets rid of the legend. The plot should look like this:
Figure 3: Ggplot exercise. Kittiwake diet species frequency of occurrence from 1982 to 2016.
All statistical analyses with a continuous response variable (y) and assumption of linear relationships can be thought of as linear models. These analyses include analysis of variance (ANOVA), regression analysis and covariance analysis. All of these analyses can be performed using similar syntax and one function called lm.
What makes these analyses different in R, is the type of explanatory variables (x). Commonly associated statistical analyses and graph types for each variable type are summarized in Table 1.
| Response variable (y) | Continuous | Categorical | None |
|---|---|---|---|
| Continuous | Regression (Scatter) | ANOVA, t-test (BoJVPrPBa) | Mixture distribution, one sample t-test (Histogram, density) |
| Categorical | Nonparametric (BoJVPrPBa) | Nonparametric (Mosaic) | One sample rank test (Mosaic) |
Continuous response and explanatory variables call for a scatter plot and regression, which we already did during the first graphing lecture.
Exercise E2.1
Try making a regression analysis between the polar cod frequency of occurrence and sea ice index using the wide.dat dataset. Remember to control the class of the data columns. Once you have done the analysis, try to plot it together with data
# note that both x & y have to be numeric for R to perform a regression
str(wide.dat[c("polar_cod", "ice")])
## 'data.frame': 19 obs. of 2 variables:
## $ polar_cod: num 85.7 85.7 20 27.3 71.4 ...
## $ ice : num 20.54 18.82 12.75 9.08 9.7 ...
# the analysis
mod <- lm(polar_cod ~ ice, data = wide.dat)
mod # model coefficients
##
## Call:
## lm(formula = polar_cod ~ ice, data = wide.dat)
##
## Coefficients:
## (Intercept) ice
## 20.818 3.344
summary(mod) # model statistics
##
## Call:
## lm(formula = polar_cod ~ ice, data = wide.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.453 -10.429 1.959 6.749 43.982
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.8179 7.7480 2.687 0.015596 *
## ice 3.3438 0.7894 4.236 0.000557 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.77 on 17 degrees of freedom
## Multiple R-squared: 0.5135, Adjusted R-squared: 0.4848
## F-statistic: 17.94 on 1 and 17 DF, p-value: 0.0005571
Using the plot function on linear model objects prints graphs to evaluate the model assumptions. In the case of simple linear models, figuring out how your model looks graphically is no issue (in this case we already know how it looks like), but when working with more complicated models, seeing the model is helpful for non-mathematicians. The effects package can plot pretty much any linear, and nonlinear, model you manage to come up with:
library(effects)
plot(effects::allEffects(mod))
As said earlier, using the plot command prints graphs that allow the user to evaluate model fit and assumptions:
par(mfrow = c(2,2))
plot(mod)
There is also a package called gvlma, which allows the user to quickly test linear model assumptions:
library(gvlma)
gvlma(mod)
##
## Call:
## lm(formula = polar_cod ~ ice, data = wide.dat)
##
## Coefficients:
## (Intercept) ice
## 20.818 3.344
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod)
##
## Value p-value Decision
## Global Stat 2.256601 0.6887 Assumptions acceptable.
## Skewness 0.001364 0.9705 Assumptions acceptable.
## Kurtosis 0.052141 0.8194 Assumptions acceptable.
## Link Function 0.478804 0.4890 Assumptions acceptable.
## Heteroscedasticity 1.724292 0.1891 Assumptions acceptable.
E2.2 Examine whether ice concentration correlates with the capelin frequency of occurrence in the kittiwake diet data from Kongsfjorden using what you learned above.
Continuous response variable and categorical explanatory variables call for t-test if the number of levels in the explanatory variable is two and for ANOVA if the number of levels is more than two.
To demonstrate tests for categorical explanatory variables, we use the wet weight dataset from Vihtakari et al. 2017. The data should be available in the Data folder.
2 in front of both the file, place it in the Data folder in your project directory and run the script below.
ww <- read.csv("Data/2 kittiwake_diet_ww.csv")
ww <- ww[c("year", "age", "stage", "colony", "total_mass")]
T-tests are a special case of linear models with a categorical explanatory variable (ANOVAs and related methods). The number of groups in t-tests has to be either one or two. In the case of one group, a difference from a certain value will be tested. Here we look at two-group (= level) t-tests:
table(ww$age)
##
## adult chick
## 715 164
mod <- t.test(total_mass ~ age, ww)
mod
##
## Welch Two Sample t-test
##
## data: total_mass by age
## t = 7.6419, df = 392.66, p-value = 1.656e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 4.799478 8.124367
## sample estimates:
## mean in group adult mean in group chick
## 16.76714 10.30522
Note that R conducts the Welch t-test in this case. This test does not assume equal variances (homoscedasticity) and works with an uneven number of replicates within a group (the reason why R picked the test), but does rely on the assumption of normality of data. The lm family functions, on the other hand, are sensitive to unequal variances among groups (heteroscedasticity) and require a somewhat similar number of groups, but assume a normal distribution of residuals instead of data.
We could also run a standard lm function to get an equivalent for two-sample (= group, level) t-test, but such test would require homoscedasticity and a similar number of replicates within both groups. Therefore, we’ll skip it and proceed to test the assumption of normality in the Welch t-test above. A quick and fairly impractical way to test normality is to run Shapiro-Wilk test:
shapiro.test(ww$total_mass)
##
## Shapiro-Wilk normality test
##
## data: ww$total_mass
## W = 0.84855, p-value < 2.2e-16
The assumption of normality is not met. By far a better way is to plot your data. Let’s see what causes the non-normality, and whether logarithm transformation could help:
par(mfrow = c(2,2))
qqnorm(ww$total_mass)
hist(ww$total_mass)
qqnorm(log(ww$total_mass))
hist(log(ww$total_mass))
Logarithm transformation does help, even though there is one outlier, which has to be taken care of. Let’s look at the data
ggplot(ww, aes(x = age, y = total_mass)) + geom_boxplot()
ggplot(ww, aes(x = age, y = total_mass)) + geom_violin()
ggplot(ww, aes(x = age, y = total_mass)) + geom_jitter() + scale_y_log10()
ggplot(ww, aes(x = age, y = total_mass)) + geom_violin() + scale_y_log10()
There is a difference in total masses and this difference becomes apparent after logarithm transformation of data. Because the data are still skewed, the “modern way” to solve the problem would be to run permutation tests. The “classic way” would be to remove the one outlier that causes the skewness in log space. The “dirty way” would be just to run a t-test on logarithm transformed data, because we see that there is a difference and we can provide a biological explanation for the difference. Whichever approach you choose, the most important thing is that you understand your data. If you do that, your conclusions are not dependent on your statistics and accidentally breaking model assumptions do not become a “deadly sin”.
You can see how to do the “modern way” in permutation tests section. For now, let’s just remove that outlier. It is caused by a stray sample, that should not be in the dataset anyway.
tmp <- ww[-which.min(ww$total_mass),]
## The distributions look ok. There is still a difference
ggplot(tmp, aes(x = age, y = total_mass)) + geom_violin() + scale_y_log10()
qqnorm(log(ww$total_mass)) ## qqplot looks okish
qqline(log(ww$total_mass))
t.test(log(total_mass) ~ age, tmp)
##
## Welch Two Sample t-test
##
## data: log(total_mass) by age
## t = 5.7393, df = 245.91, p-value = 2.787e-08
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3085408 0.6309677
## sample estimates:
## mean in group adult mean in group chick
## 2.448341 1.978587
Running an ANOVA requires a factor with more than two levels as an explanatory variable:
table(ww$colony)
##
## Blomstrand Krykkjefjellet no
## 343 392 144
str(ww$colony)
## Factor w/ 3 levels "Blomstrand","Krykkjefjellet",..: 2 2 2 2 2 2 2 2 2 2 ...
summary(lm(total_mass ~ colony, data = ww))
##
## Call:
## lm(formula = total_mass ~ colony, data = ww)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.216 -9.253 -3.900 5.794 78.780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.03588 0.72182 20.830 <2e-16 ***
## colonyKrykkjefjellet 1.18453 0.98840 1.198 0.231
## colonyno -0.01605 1.32744 -0.012 0.990
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.37 on 876 degrees of freedom
## Multiple R-squared: 0.001958, Adjusted R-squared: -0.0003202
## F-statistic: 0.8595 on 2 and 876 DF, p-value: 0.4237
The aov function is a wrapper for lm producing a more traditional ANOVA table. Which one you use, is up to you as the functions are the same:
summary(aov(total_mass ~ colony, data = ww))
## Df Sum Sq Mean Sq F value Pr(>F)
## colony 2 307 153.6 0.859 0.424
## Residuals 876 156553 178.7
When we plot data, we can see by eye that while there is no clear difference between groups, the assumption of normality is likely violated.
ggplot(ww, aes(x = colony, y = total_mass)) + geom_boxplot()
Test assumptions:
gvlma(lm(total_mass ~ colony, data = ww))
##
## Call:
## lm(formula = total_mass ~ colony, data = ww)
##
## Coefficients:
## (Intercept) colonyKrykkjefjellet colonyno
## 15.03588 1.18453 -0.01605
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = lm(total_mass ~ colony, data = ww))
##
## Value p-value Decision
## Global Stat 1.036e+03 0.000e+00 Assumptions NOT satisfied!
## Skewness 4.316e+02 0.000e+00 Assumptions NOT satisfied!
## Kurtosis 5.676e+02 0.000e+00 Assumptions NOT satisfied!
## Link Function -1.082e-14 1.000e+00 Assumptions acceptable.
## Heteroscedasticity 3.680e+01 1.311e-09 Assumptions NOT satisfied!
Not satisfied. Try log-transforming:
gvlma(lm(log(total_mass) ~ colony, data = ww))
##
## Call:
## lm(formula = log(total_mass) ~ colony, data = ww)
##
## Coefficients:
## (Intercept) colonyKrykkjefjellet colonyno
## 2.33170 0.03628 0.02289
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = lm(log(total_mass) ~ colony, data = ww))
##
## Value p-value Decision
## Global Stat 9.438e+02 0.00000 Assumptions NOT satisfied!
## Skewness 1.980e+02 0.00000 Assumptions NOT satisfied!
## Kurtosis 7.396e+02 0.00000 Assumptions NOT satisfied!
## Link Function 2.113e-13 1.00000 Assumptions acceptable.
## Heteroscedasticity 6.201e+00 0.01277 Assumptions NOT satisfied!
Still not satisfied. We can try to tackle these type data with nonparametric rank-based tests or by permutation tests.
Even though the dataset above may not be applicable for posthoc tests (comparing groups), your data may be. General posthoc routines for R might be cumbersome, but luckily there are shortcuts
amod <-aov(total_mass ~ colony, data = ww)
TukeyHSD(amod)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = total_mass ~ colony, data = ww)
##
## $colony
## diff lwr upr p adj
## Krykkjefjellet-Blomstrand 1.1845263 -1.135933 3.504986 0.4544048
## no-Blomstrand -0.0160457 -3.132470 3.100379 0.9999194
## no-Krykkjefjellet -1.2005720 -4.258861 1.857717 0.6267498
library(multcomp)
cld(glht(amod, mcp(colony="Tukey")), level = 0.05)
## Blomstrand Krykkjefjellet no
## "a" "a" "a"
library(effects)
plot(effect("colony", amod))
E2.3 Examine whether the breeding stage (stage) influences the total mass of regurgitates using the tools you have learned so far during this course.
Nonparametric tests make no assumptions about population distribution. These tests have often lower predictive power than parametric tests but are also more flexible and robust to outliers because they make fewer assumptions and often work using ranks that are less dispersed than continuous values. As an example, parametric tests can be used when the response variable has a certain order (= rank, e.g small, medium, large). Continuous response variables can also be transformed into categorical variables with an order if they do not behave when testing model assumptions. One group of nonparametric tests, rank tests, essentially convert the order of variables to integers and assume linear relationships with explanatory variables. Therefore, these tests, while not linear in a sense of continuity, can also be thought of as linear models. Yet, nonparametric tests are diverse, and some of them are fairly complex. The generalization does not apply to all of these tests.
Here we try of the most known ones: t-test analog (Mann-Whitney-)Wilcoxon test and ANOVA analog Kruskal(-Wallis) test. Yes, statisticians need a “taxonomer” to make the model names more descriptive…
wilcox.test(total_mass ~ age, ww)
##
## Wilcoxon rank sum test with continuity correction
##
## data: total_mass by age
## W = 75316, p-value = 1.272e-08
## alternative hypothesis: true location shift is not equal to 0
The age still affects total mass. The test concludes that total mass for adults and chicks come from unidentical populations. The conclusion we draw from the graphs holds.
kruskal.test(total_mass ~ colony, data = ww)
##
## Kruskal-Wallis rank sum test
##
## data: total_mass by colony
## Kruskal-Wallis chi-squared = 1.2038, df = 2, p-value = 0.5478
There is no reason to reject the null hypothesis that the different colonies would come from identical populations. Our conclusion from the graphs holds here too. Figures are better than simple statistical tests to understand your data. (Simple) stats are a method to communicate your results.
Permutation or resampling tests are based on the idea that original data are randomly resampled, and the test statistics are calculated from averaged differences of these resampled values. By repeating the resampling many times, one can create distributions for statistical tests. Consequently, permutation tests do not assume a distribution for the original data and are less sensitive to the assumption of equal variances. Yet, permutation tests are sensitive to outliers, especially for small sample sizes.
Permutation tests are a valid alternative when assumptions of classical tests are not met and have often more power to detect differences than rank-based nonparametric tests. One reason why these tests are not more widely used instead of rank-based tests is that they are computationally intensive. Nevertheless, with modern computers and normal ecological datasets (i.e. not big data), these tests may be recommended over nonparametric tests when the dataset is sufficiently large. Small datasets (say n < 20 for each group), are still better handled by nonparametric tests because resampling requires a sufficiently large population to sample from.
Yet, as for any statistical test on bivariate data, the best way to detect differences is to plot your data using a variety of graph types and transformations. If you as a researcher think that the difference is something you can defend against critical colleagues, you may proceed to show the difference using statistics.
library(coin)
coin::wilcox_test(total_mass ~ age, data = ww)
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: total_mass by age (adult, chick)
## Z = 5.69, p-value = 1.27e-08
## alternative hypothesis: true mu is not equal to 0
Still, a reason to reject the null hypothesis.
library(lmPerm)
pmod <- lmPerm::aovp(total_mass ~ colony, data = ww)
## [1] "Settings: unique SS "
summary(pmod)
## Component 1 :
## Df R Sum Sq R Mean Sq Iter Pr(Prob)
## colony 2 307 153.60 1031 0.1319
## Residuals 876 156553 178.71
Still no reason to reject the null-hypothesis.
E2.4 Show that the total mass of regurgitates tends to be lower during incubation compared to chick-rearing using non-parametric and permutation tests.
Any statistics program can produce simple linear models effortlessly. R shines when you want to repeat many tests and extract certain parameters to a table.
library(broom)
mod <- lm(polar_cod ~ ice, data = wide.dat)
broom::tidy(mod) ## Nice table
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 20.8 7.75 2.69 0.0156
## 2 ice 3.34 0.789 4.24 0.000557
broom::glance(mod) ## Nice table
## # A tibble: 1 x 11
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.513 0.485 20.8 17.9 5.57e-4 2 -83.5 173. 176.
## # … with 2 more variables: deviance <dbl>, df.residual <int>
Several regression analyses through pipe operators in dplyr. Press Ctlr + Shift + M to get the pipe operator
library(dplyr)
x <- long.dat[long.dat$variable %in% c("polar_cod", "capelin", "thy_ine", "the_lib"),] # subset the dataset
x <- droplevels(x) # remove unused variable levels
reg.pipe <- x %>% group_by(variable) %>% do(mod = lm(fo ~ ice, data = .))
tidy(reg.pipe, mod) ## All regression parameters
## # A tibble: 8 x 6
## # Groups: variable [4]
## variable term estimate std.error statistic p.value
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 polar_cod (Intercept) 20.8 7.75 2.69 0.0156
## 2 polar_cod ice 3.34 0.789 4.24 0.000557
## 3 capelin (Intercept) 16.5 7.22 2.29 0.0352
## 4 capelin ice -0.950 0.736 -1.29 0.214
## 5 thy_ine (Intercept) 30.1 8.19 3.68 0.00187
## 6 thy_ine ice -0.687 0.834 -0.823 0.422
## 7 the_lib (Intercept) 6.65 7.77 0.856 0.404
## 8 the_lib ice 1.10 0.792 1.40 0.181
glance(reg.pipe, mod) ## All R2 values
## # A tibble: 4 x 12
## # Groups: variable [4]
## variable r.squared adj.r.squared sigma statistic p.value df logLik AIC
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl>
## 1 polar_c… 0.513 0.485 20.8 17.9 5.57e-4 2 -83.5 173.
## 2 capelin 0.0893 0.0357 19.4 1.67 2.14e-1 2 -82.2 170.
## 3 thy_ine 0.0383 -0.0182 22.0 0.678 4.22e-1 2 -84.6 175.
## 4 the_lib 0.103 0.0500 20.8 1.95 1.81e-1 2 -83.6 173.
## # … with 3 more variables: BIC <dbl>, deviance <dbl>, df.residual <int>
Combine to a table:
a <- data.frame(tidy(reg.pipe, mod)) ## All regression parameters
b <- data.frame(glance(reg.pipe, mod)) ## All R2 values
a.int <- subset(a, term == "(Intercept)")
a.sl <- subset(a, term == "ice")
out <- data.frame(variable = b$variable, int = a.int$estimate, sl = a.sl$estimate, se = a.sl$std.error, f = a.sl$statistic, p = a.sl$p.value, r2 = b$r.squared)
colnames(out) <- sub("(^[[:alpha:]])", "\\U\\1", colnames(out), perl=TRUE) ## Capitalize first letter
out
## Variable Int Sl Se F P R2
## 1 polar_cod 20.817940 3.3438030 0.7894446 4.2356399 0.0005571035 0.51346061
## 2 capelin 16.523989 -0.9501423 0.7359018 -1.2911265 0.2139414518 0.08930235
## 3 thy_ine 30.103558 -0.6868548 0.8343084 -0.8232625 0.4217580462 0.03833976
## 4 the_lib 6.649056 1.1047885 0.7917700 1.3953402 0.1808751829 0.10275911
Regular expressions to format variable names:
## Modify variable names
out$Variable <- as.character(out$Variable)
out$Variable <- sub("\\_", " ", out$Variable) ## Get rid of _
# One more way to replace variable names
out$Variable[out$Variable %in% grep("th", out$Variable, value = TRUE)] <- c("T. inermis", "T. libellula")
# Capitalize the first character in each word
out$Variable <- sub("(^[[:alpha:]])", "\\U\\1", out$Variable, perl=TRUE)
Format number of digits:
out$P <- sprintf(out$P, fmt='%.3f')
##OR: out$P <- round(out$P, 4); out$P <- format(out$P, scientific = F)
out[2:3] <- lapply(out[2:3], function(k) round(k, 1)) ### Round
out[c(4:5,7)] <- lapply(out[c(4:5,7)], function(k) round(k, 2))
Writing the table to MS Word:
openxlsx::write.xlsx(out, file = "Table example.xlsx", rownames = FALSE)
You can print nice html tables using knitr and kableExtra packages:
kable(out, "html") %>% kable_styling(bootstrap_options = c("striped", "hover", "responsive"))
| Variable | Int | Sl | Se | F | P | R2 |
|---|---|---|---|---|---|---|
| Polar cod | 20.8 | 3.3 | 0.79 | 4.24 | 0.001 | 0.51 |
| Capelin | 16.5 | -1.0 | 0.74 | -1.29 | 0.214 | 0.09 |
| T. inermis | 30.1 | -0.7 | 0.83 | -0.82 | 0.422 | 0.04 |
| T. libellula | 6.6 | 1.1 | 0.79 | 1.40 | 0.181 | 0.10 |
The possibilities to “pimp up” your html tables in knitr, kableExtra and formattable are almost endless, but the syntax is weird:
library(knitr)
library(kableExtra)
library(formattable)
library(dplyr)
out %>% mutate(
Variable = ifelse(grepl("T.", Variable),
cell_spec(Variable, italic = TRUE),
cell_spec(Variable, italic = FALSE)),
Sl = color_tile("lightblue", "red")(Sl),
P = ifelse(P < 0.05, cell_spec(P, bold = TRUE),
cell_spec(P, bold = FALSE)),
R2 = color_bar("lightgreen")(R2)
) %>%
kable("html", escape = FALSE, align = "l") %>%
kable_styling(bootstrap_options = c("hover", "striped", "bordered"))
| Variable | Int | Sl | Se | F | P | R2 |
|---|---|---|---|---|---|---|
| Polar cod | 20.8 | 3.3 | 0.79 | 4.24 | 0.001 | 0.51 |
| Capelin | 16.5 | -1.0 | 0.74 | -1.29 | 0.214 | 0.09 |
| T. inermis | 30.1 | -0.7 | 0.83 | -0.82 | 0.422 | 0.04 |
| T. libellula | 6.6 | 1.1 | 0.79 | 1.40 | 0.181 | 0.10 |
E1.1
plot(x = wide.dat$ice, y = wide.dat$polar_cod)
E1.2
plot(wide.dat$ice, wide.dat$polar_cod, xlab = "Sea ice index", ylab = "Polar cod frequency of occurrence (%)", col = "red")
E1.3
p <- ggplot(x, aes(x = year, y = fo, fill = variable)) +
geom_col() +
facet_wrap(~variable, ncol = 1) +
scale_y_continuous(name = "Frequency of occurrence (%)", limits = c(0,100)) +
scale_x_continuous(name = "Year", breaks = seq(1980,2020, 5)) +
theme_bw(base_size = 8) +
theme(legend.position = "none")
ggsave("ggplot_exercise.pdf", width = 84, height = 100, units = "mm")
E2.2
ggplot(wide.dat, aes(x = ice, y = capelin)) +
geom_point() + geom_smooth(se = FALSE, method = "lm") # These data will not pass the assumptions
ggplot(wide.dat, aes(x = ice, y = log10(capelin + 1))) + geom_point() # better
mod <- lm(log10(capelin + 1) ~ ice, data = wide.dat)
summary(mod) # The slope is significant in log-space
##
## Call:
## lm(formula = log10(capelin + 1) ~ ice, data = wide.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89143 -0.32988 -0.02064 0.32035 1.21335
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.92675 0.20423 4.538 0.000291 ***
## ice -0.05246 0.02081 -2.521 0.021982 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5476 on 17 degrees of freedom
## Multiple R-squared: 0.2721, Adjusted R-squared: 0.2293
## F-statistic: 6.355 on 1 and 17 DF, p-value: 0.02198
par(mfrow = c(2,2))
plot(mod) # Not perfect, but probably good enough
gvlma(mod) # assumptions are ok
##
## Call:
## lm(formula = log10(capelin + 1) ~ ice, data = wide.dat)
##
## Coefficients:
## (Intercept) ice
## 0.92675 -0.05246
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = mod)
##
## Value p-value Decision
## Global Stat 2.796e+00 0.5926 Assumptions acceptable.
## Skewness 4.002e-01 0.5270 Assumptions acceptable.
## Kurtosis 5.420e-07 0.9994 Assumptions acceptable.
## Link Function 2.072e-01 0.6490 Assumptions acceptable.
## Heteroscedasticity 2.188e+00 0.1391 Assumptions acceptable.
# Let's plot the relationship:
ggplot(wide.dat, aes(x = ice, y = log10(capelin + 1))) +
geom_smooth(se = FALSE, color = "red") +
geom_smooth(method = "lm", color = "blue") +
geom_point()
E2.3
ggplot(ww, aes(x = stage, y = total_mass)) + geom_violin() # The distributions are tailed
ggplot(ww, aes(x = stage, y = log10(total_mass))) + geom_jitter() # Still tailed. "rearing" appears higher than "incubation"
# The assumptions for t-test or lm cannot be satisfied for these data. There seems to be one outlier. Let's remove that and see whether we can fulfill the assumptions
tmp <- ww[log10(ww$total_mass) > -1,]
ggplot(tmp, aes(x = stage, y = log10(total_mass))) + geom_violin() # Looks better, but still probably not good enough. "rearing" appears higher than "incubation"
gvlma(lm(log10(total_mass) ~ stage, tmp)) # Yup. These data require a non-parametric approach, but the most important assumption (heteroscedasticity) passes. Let's try parametric approaches anyway
##
## Call:
## lm(formula = log10(total_mass) ~ stage, data = tmp)
##
## Coefficients:
## (Intercept) stagerearing
## 0.8763 0.1776
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = lm(log10(total_mass) ~ stage, tmp))
##
## Value p-value Decision
## Global Stat 1.065e+02 0.000e+00 Assumptions NOT satisfied!
## Skewness 8.116e+01 0.000e+00 Assumptions NOT satisfied!
## Kurtosis 2.398e+01 9.757e-07 Assumptions NOT satisfied!
## Link Function 1.097e-14 1.000e+00 Assumptions acceptable.
## Heteroscedasticity 1.364e+00 2.429e-01 Assumptions acceptable.
t.test(log10(total_mass) ~ stage, tmp) # rearing is higher than incubation
##
## Welch Two Sample t-test
##
## data: log10(total_mass) by stage
## t = -4.8134, df = 205.75, p-value = 2.869e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.2504000 -0.1048791
## sample estimates:
## mean in group incubation mean in group rearing
## 0.8762843 1.0539239
summary(lm(log10(total_mass) ~ stage, tmp))
##
## Call:
## lm(formula = log10(total_mass) ~ stage, data = tmp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7443 -0.2325 0.0487 0.2932 0.9238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.87628 0.03495 25.070 < 2e-16 ***
## stagerearing 0.17764 0.03818 4.653 3.77e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4165 on 876 degrees of freedom
## Multiple R-squared: 0.02412, Adjusted R-squared: 0.02301
## F-statistic: 21.65 on 1 and 876 DF, p-value: 3.774e-06
We can conclude by looking at the plots that rearing has probably a higher mean and that these levels likely come from different (statistical) populations even though we cannot make all model assumptions to pass. The general result makes sense since during incubation the parents only have to bring food for themselves while during rearing they also have to feed the chick. A non-parametric approach is required to show this statistically.
E2.4
stats::wilcox.test(total_mass ~ stage, ww) # A strong argument to reject the null hypothesis
##
## Wilcoxon rank sum test with continuity correction
##
## data: total_mass by stage
## W = 38856, p-value = 1.158e-06
## alternative hypothesis: true location shift is not equal to 0
coin::wilcox_test(total_mass ~ stage, data = ww) # A strong argument to reject the null hypothesis
##
## Asymptotic Wilcoxon-Mann-Whitney Test
##
## data: total_mass by stage (incubation, rearing)
## Z = -4.8628, p-value = 1.157e-06
## alternative hypothesis: true mu is not equal to 0
ww %>% group_by(stage) %>% summarise(mean = mean(total_mass), sd = sd(total_mass)) # Rearing is higher
## # A tibble: 2 x 3
## stage mean sd
## <fct> <dbl> <dbl>
## 1 incubation 10.7 8.86
## 2 rearing 16.5 13.9